125 research outputs found
A Novel Method for the Absolute Pose Problem with Pairwise Constraints
Absolute pose estimation is a fundamental problem in computer vision, and it
is a typical parameter estimation problem, meaning that efforts to solve it
will always suffer from outlier-contaminated data. Conventionally, for a fixed
dimensionality d and the number of measurements N, a robust estimation problem
cannot be solved faster than O(N^d). Furthermore, it is almost impossible to
remove d from the exponent of the runtime of a globally optimal algorithm.
However, absolute pose estimation is a geometric parameter estimation problem,
and thus has special constraints. In this paper, we consider pairwise
constraints and propose a globally optimal algorithm for solving the absolute
pose estimation problem. The proposed algorithm has a linear complexity in the
number of correspondences at a given outlier ratio. Concretely, we first
decouple the rotation and the translation subproblems by utilizing the pairwise
constraints, and then we solve the rotation subproblem using the
branch-and-bound algorithm. Lastly, we estimate the translation based on the
known rotation by using another branch-and-bound algorithm. The advantages of
our method are demonstrated via thorough testing on both synthetic and
real-world dataComment: 10 pages, 7figure
Learnable MFCCs for Speaker Verification
We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend
architecture for deep neural network (DNN) based automatic speaker
verification. Our architecture retains the simplicity and interpretability of
MFCC-based features while allowing the model to be adapted to data flexibly. In
practice, we formulate data-driven versions of the four linear transforms of a
standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel
filterbank and discrete cosine transform (DCT). Results reported reach up to
6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error
rate (EER) from static MFCCs, without additional tuning effort.Comment: Accepted to ISCAS 202
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
Modern automatic speaker verification relies largely on deep neural networks
(DNNs) trained on mel-frequency cepstral coefficient (MFCC) features. While
there are alternative feature extraction methods based on phase, prosody and
long-term temporal operations, they have not been extensively studied with
DNN-based methods. We aim to fill this gap by providing extensive re-assessment
of 14 feature extractors on VoxCeleb and SITW datasets. Our findings reveal
that features equipped with techniques such as spectral centroids, group delay
function, and integrated noise suppression provide promising alternatives to
MFCCs for deep speaker embeddings extraction. Experimental results demonstrate
up to 16.3\% (VoxCeleb) and 25.1\% (SITW) relative decrease in equal error rate
(EER) to the baseline.Comment: Accepted to Interspeech 202
Learnable MFCCs for Speaker Verification
International audienceWe propose a learnable mel-frequency cepstral coefficients (MFCCs) front-end architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven version of four linear transforms in a standard MFCC extractor-windowing, discrete Fourier transform (DFT), mel filterbank and discrete cosine transform (DCT). Results reported reach up to 6.7% (VoxCeleb1) and 9.7% (SITW) relative improvement in term of equal error rate (EER) from static MFCCs, without additional tuning effort. Index Terms-Speaker verification, feature extraction, melfrequency cesptral coefficients (MFCCs)
Learnable Nonlinear Compression for Robust Speaker Verification
International audienceIn this study, we focus on nonlinear compression methods in spectral features for speaker verification based on deep neural network. We consider different kinds of channel-dependent (CD) nonlinear compression methods optimized in a data-driven manner. Our methods are based on power nonlinearities and dynamic range compression (DRC). We also propose multi-regime (MR) design on the nonlinearities, at improving robustness. Results on VoxCeleb1 and Vox-Movies data demonstrate improvements brought by proposed compression methods over both the commonly-used logarithm and their static counterparts, especially for ones based on power function. While CD generalization improves performance on VoxCeleb1, MR provides more robustness on VoxMovies, with a maximum relative equal error rate reduction of 21.6%
Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation
International audienceIn this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module. We start from the standard ASV framework of the ASVspoof 2019 baseline and approach the problem from the back-end classifier based on probabilistic linear discriminant analysis. We employ three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset. We demonstrate notable improvements on both logical and physical access scenarios, especially on the latter where the system is attacked by replayed audios, with a maximum of 36.1% and 5.3% relative improvement on bonafide and spoofed cases, respectively. We perform additional studies such as per-attack breakdown analysis, data composition, and integration with a countermeasure system at score-level with Gaussian back-end
Optimized Power Normalized Cepstral Coefficients Towards Robust Deep Speaker Verification
International audienceAfter their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might suppress intrinsic speaker variations that are useful for speaker verification based on deep neural networks (DNN). Therefore, in this study, we revisit and optimize PNCCs by ablating its mediumtime processor and by introducing channel energy normalization. Experimental results with a DNN-based speaker verification system indicate substantial improvement over baseline PNCCs on both in-domain and cross-domain scenarios, reflected by relatively 5.8% and 61.2% maximum lower equal error rate on VoxCeleb1 and VoxMovies, respectively
- …